Finding Kin-Pairs with Genetic Data
Eric C. Anderson
The Wildlife Society CKMR Workshop, Sunday November 6, 2022
Why not do full, joint pedigree reconstruction?
After all, doing so let’s you use all the available data
Cast the question as a statistical test applied to pairs:
Then, if it were the case that all pairs were either Unrelated or Parent-offspring
And, if we fixed our false positive rate (FPR) to be \(\alpha\)…
Then, amongst all possible statistical tests, the one with the highest power (lowest false negative rate) will be based on the likelihood ratio. \[ \frac{P(G_i, G_j~|~i~\mbox{and}~j~\mbox{are}~\mathrm{PO})} {P(G_1, G_2~|~i~\mbox{and}~j~\mbox{are}~\mathrm{Unrelated})} \] Where \(G_1\) and \(G_2\) denote the observed genotypes in the two individuals of the pair.
This holds for any two relationships: \[ \frac{P(G_i, G_j~|~i~\mbox{and}~j~\mbox{are}~R_1)} {P(G_1, G_2~|~i~\mbox{and}~j~\mbox{are}~R_2)} \] is a most powerful test (Neyman-Pearson Lemma).
We previously showed we can use \(\boldsymbol{\kappa}(R)\) (the \(\kappa\) coefficients for relationship \(R\)) to calculate the joint probability \(P(G_1, G_2~|~i~\mbox{and}~j~\mbox{are}~R)\):
\[ \begin{aligned} P(G_i, G_j|\boldsymbol{\kappa}) &= \kappa_0 P(G_i)P(G_j) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{\tiny\mathrm{no~gene~copies~IBD}} \\ &+ \kappa_1 P(G_i)P(G_j | G_i=\mathrm{ma}, G_j=\mathrm{kid}) ~~~~~~~~~{\tiny\mathrm{1~gene~copy~IBD}} \\ &+ \kappa_2 P(G_i)\mathcal{I}\{P(G_j)=P(G_i)\} ~~~~~~~~~~~~~~~~~~~~~~~~~~{\tiny\mathrm{2~gene~copies~IBD}} \\ \end{aligned} \] where \(\mathcal{I}\{\cdot\}\) is an indicator function returning 1 when what is inside it is true, and 0 otherwise.
With \(L\) different loci, writing the genotype at the \(\ell^\mathrm{th}\) locus as \(G^{(\ell)}\) we can use: \[ P(\boldsymbol{G}_1, \boldsymbol{G}_2~|~\boldsymbol{\kappa}(R)) = \prod_{\ell = 1}^L P(G_{1}^{(\ell)}, G_{2}^{(\ell)}~|~\boldsymbol{\kappa}(R)) \] This is actually the correct probability so long as:
It is pretty easy to verify that (1) is the case—merely test for LD.
Assessing (2) is more difficult.
So, assumption (2) often gets made for this calculation, even when it is known to be wrong. (We come back to this later.)
Using that (approximate) joint probability of the genotypes we can compute the (approximate) likelihood ratio suggested by the NP-lemma: \[ \mathrm{For~PO~vs~Unrelated~pairs:}~~~~~~~~~~~~~~~~~~~\frac{P(\boldsymbol{G}_i, \boldsymbol{G}_j~|~\boldsymbol{\kappa}= (0, 1,0))} {P(\boldsymbol{G}_i, \boldsymbol{G}_j~|~\boldsymbol{\kappa}= (1, 0,0))} \]
\[ \Lambda_{PO/U} = \log\biggl(\frac{P(\boldsymbol{G}_1, \boldsymbol{G}_2~|~\boldsymbol{\kappa}= (0, 1,0))} {P(\boldsymbol{G}_1, \boldsymbol{G}_2~|~\boldsymbol{\kappa}= (1, 0,0))} \biggr) \]
\[ \begin{aligned} \Lambda_{PO/U} & = & \log\biggl( \prod_{\ell = 1}^L \frac{P(G_{1}^{(\ell)}, G_{2}^{(\ell)}~|~\boldsymbol{\kappa}= (0, 1, 0))} {P(G_{1}^{(\ell)}, G_{2}^{(\ell)}~|~\boldsymbol{\kappa}= (1, 0, 0))} \biggr) \\ & = & \sum_{\ell = 1}^L\log\biggl( \frac{P(G_{1}^{(\ell)}, G_{2}^{(\ell)}~|~\boldsymbol{\kappa}= (0, 1, 0))} {P(G_{1}^{(\ell)}, G_{2}^{(\ell)}~|~\boldsymbol{\kappa}= (1, 0, 0))} \biggr) \end{aligned} \]
Good point.
In our survey of genotyping technologies, we noted that they all can produce errors to some extent.
Errors are most problematic in assessing likelihoods for the PO relationship.
These probabilities can be computed in a way that accounts for genotyping error: \[ \Lambda_{PO/U} = \log\biggl(\frac{P(\boldsymbol{G}_1, \boldsymbol{G}_2~|~\boldsymbol{\mu}, \boldsymbol{\kappa}= (0, 1,0))} {P(\boldsymbol{G}_1, \boldsymbol{G}_2~|~\boldsymbol{\mu}, \boldsymbol{\kappa}= (1, 0,0))} \biggr) \]
Error model can be simple (or complex)
R package ‘CKMRsim’ (we will use it this afternoon) can accommodate all possible single-locus genotyping models.
probability that a true kin pair (PO in this case) has \(\Lambda \leq \Lambda_c\).
probability that a non kin pair (U in this case) has \(\Lambda > \Lambda_c\).
\[ \mathrm{FPR}_{\Lambda_c} = \sum_{\Lambda_{PO/U}}\mathcal{I}\{ \Lambda_{PO/U} > \Lambda_c\} P(\Lambda_{PO/U}~|~i,j~\mathrm{Unrelated}) \approx \frac{1}{M}\sum_{m=1}^M \mathcal{I}\{ \Lambda_{PO/U}^{(m)} > \Lambda_c\} \] with each \(\Lambda^{(m)}_{PO/U}\) simulated from its distribution assuming the pair is unrelated.
\[ \begin{aligned} \mathrm{FPR}_{\Lambda_c} &= \sum_{\Lambda_{PO/U}}\mathcal{I}\{ \Lambda_{PO/U} > \Lambda_c\} \frac{P(\Lambda_{PO/U}~|~i,j~\mathrm{Unrelated})} {P(\Lambda_{PO/U}~|~i,j~\mathrm{PO})} P(\Lambda_{PO/U}~|~i,j~\mathrm{PO}) \\ & \approx \frac{1}{M}\sum_{m=1}^M \biggl[\mathcal{I}\{ \Lambda_{PO/U}^{(m)} > \Lambda_c\} \frac{P(\Lambda^{(m)}_{PO/U}~|~i,j~\mathrm{Unrelated})} {P(\Lambda^{(m)}_{PO/U}~|~i,j~\mathrm{PO})}\biggr] \end{aligned} \] with each \(\Lambda^{(m)}_{PO/U}\) simulated from the importance sampling distribution, \(P(\Lambda^{(m)}_{PO/U}~|~i,j~\mathrm{PO})\).
Easy to accommodate this in CKMR models if you can estimate the FNR.
But how does it do that?
First, a digression into HS, A-N, and GP-GC, pairs.
Note that even if you don’t have a map, you can still compute \(\Lambda\). It is a valuable statistic even if it is not the correct probability.
Have scanned 10000 pairs in Mendel output file.
Have scanned 10000 pairs in Mendel output file.
Let’s compute the mean and variance of those:
# A tibble: 8 × 5
# Groups: simmed_from, numer_wts [8]
simmed_from numer_wts denom_wts mean_lograt var_lograt
<chr> <chr> <chr> <dbl> <dbl>
1 FS-linked FS=1 U=1 44.8 192.
2 FS-unlinked FS=1 U=1 44.7 117.
3 HS-linked HS=1 U=1 12.3 40.5
4 HS-unlinked HS=1 U=1 12.2 29.8
5 U-linked FS=1 U=1 -36.9 64.0
6 U-linked HS=1 U=1 -10.3 18.1
7 U-unlinked FS=1 U=1 -37.0 67.0
8 U-unlinked HS=1 U=1 -10.3 18.7
And that extra variance is causes the power to go down:
For unlinked:
# A tibble: 5 × 10
FNR FPR se num_no…¹ Lambd…² pstar mc_me…³ numer…⁴ denom…⁵ true_…⁶
<dbl> <dbl> <dbl> <int> <dbl> <chr> <chr> <chr> <chr> <chr>
1 0.01 3.69e-12 3.96e-13 9900 20.7 FS IS FS U U
2 0.05 1.53e-14 9.52e-16 9500 27.3 FS IS FS U U
3 0.1 5.74e-16 2.76e-17 9000 31.2 FS IS FS U U
4 0.2 1.10e-17 4.26e-19 8000 35.6 FS IS FS U U
5 0.3 4.87e-19 1.83e-20 7000 38.7 FS IS FS U U
# … with abbreviated variable names ¹num_nonzero_wts, ²Lambda_star, ³mc_method,
# ⁴numerator, ⁵denominator, ⁶true_relat
For linked:
# A tibble: 5 × 10
FNR FPR se num_no…¹ Lambd…² pstar mc_me…³ numer…⁴ denom…⁵ true_…⁶
<dbl> <dbl> <dbl> <int> <dbl> <chr> <chr> <chr> <chr> <chr>
1 0.01 1.16e- 9 3.15e-10 9989 12.9 FS IS FS U U
2 0.05 1.16e-12 1.08e-13 9850 22.3 FS IS FS U U
3 0.1 1.92e-14 1.23e-15 9525 27.0 FS IS FS U U
4 0.2 1.09e-16 4.69e-18 8639 33.0 FS IS FS U U
5 0.3 2.01e-18 7.59e-20 7465 37.3 FS IS FS U U
# … with abbreviated variable names ¹num_nonzero_wts, ²Lambda_star, ³mc_method,
# ⁴numerator, ⁵denominator, ⁶true_relat